Due date: 5th April 2024
Your name (UPI)
Please rename the file and add your name and UPI into the file name, and then replace the author into your name in the header above, including UPI insider the bracket.
This is an R Markdown, which further produces either a HTML or PDF file.
Please follow the provided instructions carefully to set up the analysis and answer all questions, which are numbered using Roman numerals, such as Question i:. One point for each question, and one bonus point for clarity and producing a well-written report.
Your responses to questions should be compiled into a data analysis report format. Ensure that no answer blocks are overlooked, with the exception of the example provided below. Substitute ‘Your answer’ with your actual response to each question, limiting your answers to a maximum of 150 words. For questions that necessitate a screenshot, please ensure that the image is clear and legible.
Answer starts:
Your answer
Answer ends
Ensure you answer all questions and submit your report in either HTML or PDF format, which can be generated from this markdown file. Before submitting, please verify that your report opens and displays correctly.
Additionally, it’s beneficial to include all relevant figures in your answers, along with the log files and tree files associated with your analysis (excluding the BEAST 2 posterior tree sets due to their large file sizes).
For example:
To minimize the size of your submission, consider compressing the files, ensuring to use only the ZIP format for compatibility.
Please ensure that the following software packages are installed and ready to use on your computer:
Please download the alignment of phylodynamics of the Respiratory syncytial virus subgroup A (RSV-A). RSV infects the human lower respiratory tract, causing symptoms that are often indistinguishable from the common cold.
The alignment consists of 129 molecular sequences coding for RSV-A’s G protein (Zlateva et al 2004), a glycoprotein that allows the virus to attach itself to the host cells’ membranes.
We will conduct two analyses - one employing the maximum likelihood method, and the second using Bayesian phylogenetic inference to uncover the temporal signal within the provided alignment.
Run iqtree2 on the
dataset to construct a maximum likelihood tree, using the command
iqtree -s RSV2.nex. Please attach the maximum likelihood
tree file (e.g., “RSV2.nex.treefile”) and iqtree log file (e.g.,
“RSV2.nex.log”) with the report.
For further support, please refer to the online documentation and tutorials of iqtree2.
Iqtree2 offers a wide range of DNA substitution models. Upon execution, it leverages ModelFinder to identify the optimal model for your data, subsequently building the tree using this model.
Question i: By referring to the iqtree log file (e.g., “RSV2.nex.log”), what was the “best-fit” substitution model? What was its log-likelihood? Please provide a brief, yet precise, description of the selected model.
Answer starts:
Your answer
Answer ends
Question ii: Plot the maximum likelihood tree. What are the units of branch length in this maximum likelihood tree?
Answer starts:
Your answer
Answer ends
We will now employ TempEst to analyse the temporal signal and ‘clock-likeness’ of the maximum likelihood tree generated by iqtree2. TempEst is specially designed for analysing trees that have not been inferred under a molecular-clock assumption to see how valid this assumption may be. Additionally, it identifies the optimal root of the tree at the position that is likely to be the most compatible with the assumption of the molecular clock.
If you are unsure how to answer the following questions, please refer to the article by Rambaut et al 2016.
Launch TempEst and import the maximum likelihood tree file. Record and familiarize yourself with the statistics displayed in the top left conner of the interface.
Then, select “Parse Date” and replicate the settings shown in the following image to extract years from taxon labels.
Navigate to the “Tree” tab to view the visualized tree, and proceed to the “Root-to-tip” tab to understand the implications of the regression analysis results
Next, activate the “best-fitting root” option in the top-left corner to root the tree at the position most congruent with the molecular clock hypothesis, utilizing the default “heuristic residual mean squared” function.
Report the differences between two trees according to the guidelines provided below.
Record both statistical results before and after ticking the “best-fitting root” option by text. Pick up two statistic from them, which you think they are the matter of the most. Explain their implications, and compare them with the analysis each other.
Question i:
Answer starts:
Your answer
Answer ends
After activating the “best-fitting root” option, navigate to the “Tree” tab.
Question ii:
Answer starts:
Your answer
Answer ends
Question iii:
Answer starts:
Your answer
Answer ends
As Rambaut et al 2016 demonstrated, regression of root-to-tip genetic distance against sampling time can be used as a simple diagnostic tool for molecular clock models.
Keep the same option in TempEst, navigate to the “Root-to-tip” tab, and answer the following questions.
Question iv:
Use the number and figure to support your conclusion.
Answer starts:
Your answer
Answer ends
In this section, we will estimate the evolutionary rate from alignments containing sequences isolated at various times (heterochronous or time-stamped data) using BEAST 2. BEAST 2 is a comprehensive tool for Bayesian phylogenetic analysis of molecular sequences across different platforms. It uses Markov chain Monte Carlo (MCMC) to average over tree space, so that each tree is sampled proportionally to its posterior probability.
Launch BEAUti, set up the analysis following these steps:
Upon selecting the file, the “Partitions” panel will display a single partition showing the number of taxa and sites in the alignment. To examine the alignment more closely, double-click the partition to open the alignment viewer.
The gene in the dataset encodes a protein, which means its sequence is composed of triplets, or codons, each translating into an amino acid. Crucially, it’s established that different codon positions evolve at varying rates due to the structure of the ‘genetic code’ (Read Bofkin & Goldman 2007 for details). By incorporating this knowledge into the anaysis, we will divide the alignment into three partitions that correspond to each codon position, thereby accounting for the varied evolutionary rates across these positions.
This action is based on our gene sequence beginning with its first complete codon at the third nucleotide of the alignment. As a result, this process generates three distinct rows in the partitions panel, each representing one of the three codon positions.
Question i: For this analysis, please elucidate the rationale behind assigning distinct substitution models to each partition, and clarify the implications of linking clock models and trees from a phylogenetic perspective.
Answer starts:
Your answer
Answer ends
_ with
s, select “Add fixed value to 1900”, and click “OK”.This action will extract the years from the taxa names and append them to the “Date (raw value)” column in the table. Verify if these years are accurately parsed.
You may notice an orange warning dot at the bottom of the panel initially. However, after setting all three substitution rates to estimate, the warning will disappear, and the option to “fix mean mutation rate” will be triggered.
To streamline this process, utilize the “Clone” function to replicate
the configuration. Hold the shift key to select all site
models on the left side, then click “OK” to clone the settings from a
selected site model (the first one). Afterwards, review each site model
to ensure their configurations are identical.
The primary goal here is to configure the analysis to estimate the relative substitution rates of codons, which are relative to a general rate defined in the molecular clock model.
Question ii: Considering the principles of phylogenetics and insights gleaned from the previous TempEst analysis, please explain why we can estimate substitution rates in this case.
Answer starts:
Your answer
Answer ends
0.001.Question iii:
Answer starts:
Your answer
Answer ends
To set up the priors, select the “Priors” tab. We are going to choose a simple tree prior for this analysis, so select “Coalescent Constant Population”.
For our molecular clock model, we will set the prior on the
“clockRate” parameter to a log-normal distribution with mean of
-5, and standard deviation of 1.25 (M=-5 and
S=1.25). The plot of this prior distribution and its quantiles can be
visualised on the right side.
To configure the priors, navigate to the “Priors” tab. For this analysis, opt for a straightforward tree prior by choosing “Coalescent Constant Population.”
In setting up our molecular clock model, assign a log-normal
distribution as the prior for the “clockRate” parameter, specifying a
mean of -5 and a standard deviation of 1.25
(M=-5, S=1.25). You can examine the shape and quantiles of this
distribution in the visual plot on the right.
Select the MCMC tab, adjust the “Chain Length” to
20000000 (20 million), establishing the total number of
iterations for the Markov chain. For the “tracelog” and “treelog”, set
the “Log Every” to 10000 (10 thousand). This logging
frequency ensures both trace and tree log files are synchronized,
capturing 2000 samples each for subsequent analysis. Modify the
“screenlog” setting, by changing the “Log Every” to 100000
(100 thousand). This adjustment will minimize the volume of outputs
displayed on the screen.
After implementing these adjustments, proceed to save your configuration by exporting it as an XML file, e.g. RSV2.xml, by clicking “Save” in the “File” menu.
To analyse the results, run the BEAST 2 XML file twice with different seeds, and output files in separate sub-directories to avoid overwriting. We recommend you create two sub-folders “beast-run1” and “beast-run2”, then run the XML from each sub-folder (e.g. beast-run1). This will use the folder as a working directory and output all files there.
Launch “BEAST”, choose your XML, and click “Run”. Then, it will start a new run. If you want to continue your previous BEAST 2 run, you need to change the drop-down list into “resume: appends log to the existing files (if any)”. Remember to check the seed in each screen output, making sure they are different between two runs.
Alternatively, you can run it from the command line
YOUR_PATH/bin/beast -beagle_SSE ../RSV2.xml. It is optional
to use BEAGLE. If you do not
have it installed, you can remove -beagle_SSE, which would
not affect the result. For resuming, use the command
YOUR_PATH/bin/beast -beagle_SSE -resume ../RSV2.xml.
For this analysis, BEAST 2 produces two types of output files:
To assess the convergence of the MCMC runs, indicating that the phylogenetic MCMC calculation has reached equilibrium, we utilize “Tracer.” Open Tracer, then import the two BEAST log files, named “RSV2.log,” by dragging and dropping them into the left-side panel sequentially.
One approach is to compute ESS, which Tracer has done for you. In addition, you can navigate to the “Trace” tab in the upper-right corner, to the visualisation of the sampled values against the step in the MCMC chain. It will help you see if the chain is mixing well and to what extent the samples are correlated.
The another approach is to make cross-chain comparisons by running
multiple independently initialized chains. Tracer offers simple
visualisation to view the statistics and traces from multiple files. In
the “Trace Files” table, hold the shift and click to select
both log files. Please do not select the combined trace. Then,
single-click each parameter, you can see the combined plots, such as
statistics and traces.
Question iv:
Answer starts:
Your answer
Answer ends
After the MCMC runs have converged, we are ready to present the results. Given the simplicity of our dataset, we will proceed with the analysis using a single log file.
In the “Trace Files” table, click once on the first log file to
select it. Then, using the shift key, select the three
“mutationRate” parameters simultaneously. Navigate to the “Marginal
Density” tab, and position the legend in the “Top-Right” corner.
Question v:
Answer starts:
Your answer
Answer ends
Question vi:
Answer starts:
Your answer
Answer ends
We will use the program “TreeAnnotator” to summarise a maximum clade credibility (MCC) tree from a posterior tree set (logged in the tree file). TreeAnnotator is an application that comes with BEAST. First of all, we keep the burn-in percentage to 10(%), and the target tree type to “Maximum clade credibility tree”. Then we select the “Node heights” to “Median heights”, and provide input and output file names. This will rescale the node height to reflect the posterior median node heights for the clades contained in the MCC tree. More details about settings are available from here.
The visualization tools like “FigTree” can be used to visualize the MCC (Maximum Clade Credibility) tree, where some meta-data has been annotated by TreeAnnotator.
Navigate to the “Trees” section in the left panel and enable the “Order nodes” check-box. This will organize the tips alphabetically or numerically by their labels. Proceed to the “Node Labels” section, expand it, and use the “Display:” drop-down list to select the specific statistics you wish to show on the tree nodes for a comprehensive post-analysis.
Here is the screenshot for FigTree plotting the HPD interval of the time of MRCA at each internal node:
Question vii:
Answer starts:
Your answer
Answer ends
Refer to the original publication Zlateva et al 2004 that provides the dataset context. Your next task involves analysing the MCC tree to identify clades corresponding to the five primary subgroups detailed in the study: GA1, GA2, GA4, GA5, and BE/A1, as depicted in Figure 1 of the publication. Please exclude the three sequences labelled USALongs56, AUSA2s61, and S2s76 from your clade selections as these are considered as outgroup sequences for this analysis.
Question viii:
Answer starts:
Your answer
Answer ends
Question ix:
Answer starts:
Your answer
Answer ends